handle resolve of NULL pointers #4327

andrewstrohman · 2025-11-11T03:15:24Z

When a pointer to a BTF struct is NULL, every type of matchArgs selector should fail to match, because the argument was not really resolved.

When we fail to dereference during resolve, report the "depth" at which the failure occurred, and expose this in the event, so that users can distinguish between bogus and legitimate arg values.

When investigating how to plumb this to the event, I found a bug regarding how returnCopy works. I've added a fix for that as well.

tdaudi

Hello !

Thank you for taking time to write this PR.
I opened an issue for this few month ago, see #3728.

I don’t know of you will share my point of view because will I have thinking about this i have found 3 problems :

First, what happen if the null structure is not the last resolving field. How can you track where it ends ?
Also, what happened if you are resolving int *value and the ptr is null ?
And if you are resolving a struct (e.g. type: file) you need to let extract_arg ends correctly, because at this point you have no idea if extract_arg return a legitimate 0 or a null struct. But when resolving it's fields, you will figure out that it is actually NULL.

The potential solution I was thinking of for this was to take all the return values of the probe_read and find a way to send this error to the userspace if it is < 0. So in case of a null pointer, you can have a new error code that the kernel does not cover, for instance -100, that is dedicated to it.

tdaudi · 2025-11-11T10:17:48Z

bpf/process/generic_calls.h

 		return 1;
+	/* NULL pointer */
+	if (*data->arg == 0) {
+		*data->null_ptr_found = true;


If the last btf_config resolve in a null pointer, you would not reach that code right ?

Yes, this is good point. I think my new placement for error detection takes care of this problem.

tdaudi · 2025-11-11T10:21:18Z

pkg/sensors/tracing/uprobe_amd64_test.go

+    - index: 1
+      type: "uint8"
+      btfType: "mystruct"
+      resolve: "sub.v8"


I think 2 test cases should be added. This one is good, but if sub alone is resolved, does it raise an error too ? For instance, if type: file is null, does Tetragon display null as well ?

I think 2 test cases should be added.

You go on to describe what sounds like one test case. What is the other test case you're thinking about here?

but if sub alone is resolved, does it raise an error too ? For instance, if type: file is null, does Tetragon display null as well ?

As I mentioned in #3728, I think that we should break this up into two independent tasks. One is to handle the dereferencing in extract_arg_depth, and the other is to handle issues for the terminal types reach by resolve (in read_arg), which would happen even without the use of the resolve feature. With pointer types, read_arg() has some issues when the pointer is NULL. Because of this, I don't want to test scenarios where we resolve to a NULL pointer and pass that NULL pointer to read_arg() (yet).

Also, for the example you have above:

what happened if you are resolving int *value and the ptr is null ?

From my testing, I see that read_arg() is passed a pointer to an int in this scenario. If we want to resolve the int value that the pointer points to, I think we need to call extract_arg_depth one more time in order to dereference, because read_arg() takes the integer "by value" instead of by pointer to an integer. So, I don't want to test this scenario until we make this work in the non-NULL case.

So, I think that I should add a test for when the first pointer is NULL, and when the second to last pointer is NULL, but I don't want to test the scenario where the last pointer is NULL, because that problem is not specific to resolve, so I want to solve that issue independently.

netlify · 2025-11-12T21:13:12Z

✅ Deploy Preview for tetragon ready!

Name	Link
🔨 Latest commit	`a32769e`
🔍 Latest deploy log	https://app.netlify.com/projects/tetragon/deploys/69339a892a15b60008dfa5d3
😎 Deploy Preview	https://deploy-preview-4327--tetragon.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

andrewstrohman · 2025-11-13T05:18:51Z

Thank you for taking time to write this PR. I opened an issue for this few month ago, see #3728.

Thanks for bringing this to my attention, and your feedback/ideas here.

First, what happen if the null structure is not the last resolving field. How can you track where it ends ?

I've changed things so that I keep track of which level of depth we are unable to dereference. I use a sentinel value of -1 to indicated that no issues were detected. This is then used as an indicator that selectors against this arg shouldn't match. I'm thinking that perhaps this level of depth of failure information could be put in the event not only to distinguish resolve failures from success, but also what depth the failure occurred.

Also, what happened if you are resolving int *value and the ptr is null ?

I'm investigated this, and it seems that this scenario doesn't work for the non-NULL case. So, I want to fix that first, before considering how to handle the NULL case. This is what I'm trying:

struct mystruct {
	uint8_t  v8;
	uint16_t v16;
	uint32_t v32;
	uint64_t v64;
	struct mysubstruct sub;
	struct mysubstruct *subp;
	uint32_t *v32p;
};

In the userspace program I'm testing against, I do this:

struct mystruct s = {0};
uint32_t value = 3;
s.v32p = &value;

The policy's arg config:

    - index: 1
      type: "uint32"
      btfType: "mystruct"
      resolve: "v32p"

Is this the scenario you're describing here?

It seems like we might need to do one more iteration of calling extract_arg_depth and dereference again to make this work as expected. Put another way, I think that we are acting on a pointer as if it's an integer.

And if you are resolving a struct (e.g. type: file) you need to let extract_arg ends correctly, because at this point you have no idea if extract_arg return a legitimate 0 or a null struct. But when resolving it's fields, you will figure out that it is actually NULL.

I'll look into this more closely, but from reading the code it seems that we don't handle the type: file NULL case correctly, even outside of resolving. At first glance, I'm not seeing checks for probe_read() return values even though this type requires dereference. Overall, I thinking that my strategy is to fix issues in read_arg() where we should be watching the return value of probe_read() (types that require dereference), if they exist, independently from catching dereference problems in extract_arg_depth. I think the combination of these two methods will cover all bases. What do you think about this approach?

The potential solution I was thinking of for this was to take all the return values of the probe_read and find a way to send this error to the userspace if it is < 0.

Thanks for this idea. I think this is the right approach, and I have started moving in this direction.

So in case of a null pointer, you can have a new error code that the kernel does not cover, for instance -100, that is dedicated to it.

I'm not sure if I follow this. I think we need to communicate the error out of band of the arg value or else we will have a collision. Perhaps we can report the depth at which dereferencing failed in the event, in order to distinguish between resolve success and failure.

I'm trying to figure out what is the right behavior is for when we cannot resolve due to a NULL pointer. My current approach is to indicate there was a resolve error, but not let that prevent the event from firing, unless there is a matchArgs selector associated with the arg. In this method, I would communicate that the arg was not actually resolved in the event somehow. But, I realized from the issue that you created about this, that NULL string pointers prevent an event from firing, regardless of selectors. So maybe that's a more appropriate outcome.

kkourt · 2025-11-13T07:24:44Z

I'm trying to figure out what is the right behavior is for when we cannot resolve due to a NULL pointer. My current approach is to indicate there was a resolve error, but not let that prevent the event from firing, unless there is a matchArgs selector associated with the arg. In this method, I would communicate that the arg was not actually resolved in the event somehow.

Above makes sense to me.

But, I realized from the issue that you created about this, that NULL string pointers prevent an event from firing, regardless of selectors. So maybe that's a more appropriate outcome.

I don't think that the appropriate outcome.

I'll use the example from the issue for clarity:

  lsmhooks:
  - hook: "bprm_check_security"
    args:
    - index: 0
      type: "string"
      resolve: "executable.f_path.dentry.d_name.name"
    selectors:
    - matchActions:
      - action: Post

In this case (that is, where there no matchArgs), I think we should generate an event (perhaps with an empty argument, or even a special value to indicate that some resolution went wrong).

andrewstrohman · 2025-11-21T21:08:33Z

I'm converting back to draft, as I just realized a problem with my approach. I need to think about tracepoints and usdt a bit more.

olsajiri · 2025-11-21T22:47:09Z

hi, just some thoughts for discussion.. I dont feel strongly about it, I'm not sure what the right solution is ;-)

do we need that error-depth number in the final event? I wonder if you see it in event in most cases you'll know what failed.. in which case using some flag instead would be enough?

and if there's reason to keep it (I guess long derefs would justify that) could we maybe store the part of the deref chain (string) that failed instead of depth (we have that deref chain info already)

and how about using metric instead, something like deref_fails_cnt[policy,sensor,arg-idx,deref-chain] .. as I'm not sure this info should be part of the final event.. but I could be easily wrong

andrewstrohman · 2025-11-22T02:56:06Z

hi, just some thoughts for discussion.. I dont feel strongly about it, I'm not sure what the right solution is ;-)

Thanks for the review and helping me think this through.

do we need that error-depth number in the final event? I wonder if you see it in event in most cases you'll know what failed.. in which case using some flag instead would be enough?

I'd be OK with just a boolean flag that indicates that resolving failed. My primary goal here is to indicate that the argument value is bogus. I can't just remove the bogus arg value from the event because that would shift some args values to the left of what was configured in the spec, causing a mis-match with expectations.

I think I went down this path of recording which dereference failed, because of @tdaudi's earlier comment. He said:

First, what happen if the null structure is not the last resolving field. How can you track where it ends ?

@tdaudi, did I understand your comment correctly? Is this something that is important to you? If this is not what you meant, or you don't feel strongly about this, I think we should just go with a boolean flag. I initially thought this might be nice to expose, but it's not important to me.

If users really want to know where the NULL deference happened, they can resolve the intermediate pointers to see which one was NULL, if the they have the args to spare (we have a max of 5). This way, they can also run a selector against it too, whereas just reporting the depth does not allow them to filter based on it.

and if there's reason to keep it (I guess long derefs would justify that) could we maybe store the part of the deref chain (string) that failed instead of depth (we have that deref chain info already)

When you say "long derefs" here, I think you mean multiple dereferences per resolve. Please let me know if I misunderstood. I agree that if we decide to pinpoint the exact deference that failed it would be better to go all the way and show them where it failed via a prefix of the resolve configuration string. Just reporting the depth requires the user to know too much (and do extra investigation) in order to be able to interpret it in a useful way. But I'm currently leaning toward just changing it to be a boolean.

and how about using metric instead, something like deref_fails_cnt[policy,sensor,arg-idx,deref-chain] .. as I'm not sure this info should be part of the final event.. but I could be easily wrong

When you say "this info", I think you mean the depth of dereference failure here. I think we agree that we need some signal in the event that indicates that the arg value is bogus. Please correct me if I misunderstand your position here.

I would prefer your first or second second suggestion over this, because the user cannot definitely know which event the counter increment was related to.

When we fix read_arg(),so that NULL pointers won't prevent events(see here for background), we can use this same boolean flag to indicate the dereference failure, unifying the two sources of dereference failures.

tdaudi · 2025-11-22T19:16:34Z

@tdaudi, did I understand your comment correctly? Is this something that is important to you? If this is not what you meant, or you don't feel strongly about this, I think we should just go with a boolean flag. I initially thought this might be nice to expose, but it's not important to me.

If users really want to know where the NULL deference happened, they can resolve the intermediate pointers to see which one was NULL, if the they have the args to spare (we have a max of 5). This way, they can also run a selector against it too, whereas just reporting the depth does not allow them to filter based on it.

My initial idea was to have a convinient way to know which field in the resolve path is null to avoid spending time on updating the policy and testing again. But for production, I think it is interesting to know where the resolve fails when the null value is unexpected. Knowing where the null appears help understanding the reason and avoid the difficulty of having to reproduce it, especially if it occurs rarely.

olsajiri · 2025-11-23T15:24:38Z

bpf/process/generic_calls.h

 		a = (&e->a0)[arg_index];

-	extract_arg(config, index, &a);
+	extract_arg(config, index, &a, &e->resolve_err_depth[index]);


so if we fail to resolve we still continue and store the 'unresolved data' which likely null, right? I think we can skip it

but how about we make this more generic and we start each argument data with '__u32 error' status and in case of failure we write just error != 0 as argument value.. in case of resolve failure we encode the depth into it

then on user space side the getArg would just read first uint32 from event reader to get the argument status
so all the extra retrieval of error depth from each probe type would not be needed, also probably the argument index will be clear, because you have the error for argument you are about to read, or skip reading if it's != 0

we could use this to store other errors that happen during argument encoding, which there are plenty

and into final event instead of the depth value, I'd add error string that in case of resolve failure would contain something like: "failed to resolve current->file->f_inode"

seems like this could save some cycles and prevent bogus arguments values, because IIUC if we bail from argument storing in the middle on kernel side the user side still tries to read the whole part

so if we fail to resolve we still continue and store the 'unresolved data' which likely null, right? I think we can skip it

I think "skip it" means skipping calling read_arg for the arg. This way, e->args isn't filled with the bogus value.

but how about we make this more generic and we start each argument data with '__u32 error' status and in case of failure we write just error != 0 as argument value.. in case of resolve failure we encode the depth into it

OK, so each argument value in e->args is proceeded by a u32 that holds the depth of failure, or 0 if no failure occurred. If the u32 is non-zero, no bogus arg value follows -- the next bytes in e->args are about the subsequent arg.

we could use this to store other errors that happen during argument encoding, which there are plenty

I think these other errors mean passing NULL pointers to read_arg() (which are problems that pre-dated resolve functionality). Is that what you're thinking here or something else?

and into final event instead of the depth value, I'd add error string that in case of resolve failure would contain something like: "failed to resolve current->file->f_inode"

Yes, I agree adding a string that shows where we failed to dererefernce, is better than just exposing the depth of the error. So, from an implementation perspective, this means instead of adding member int32 resolve_err_depth to KprobeArgument, I would add a string member/attribute to KprobeArgument that describes the error that happened when reading the arg value. If the string is empty, it means there was no error reading the arg. See below for another possible way to interpret this. Regardless, in userspace, I would use the resolve failure depth send from the bpf program to determine where the NULL pointer was and only expose that to the user.

seems like this could save some cycles and prevent bogus arguments values, because IIUC if we bail from argument storing in the middle on kernel side the user side still tries to read the whole part

In your suggested approach, I like that we don't send a bogus value from bpf to userspace. I like that the argument read outcome sits directly in front of the argument value in e->args, so that they are coupled and are less like to get mixed up.

What should getArg() return when it sees a non-zero u32 proceeding the value it wants to read? Should I create something like:

type MsgGenericKprobeArgFailed struct { Index uint64 Value string Label string }

In this approach, there wouldn't be an error string associated with every KprobeArgument. The error string would only be associated with KprobeArgument_ArgFailed. I'm not sure if there would be a problem if the user expected a receive an KprobeArgument_UintArg, for example, in a specific position within the args list, but instead found type KprobeArgument_ArgFailed? If you don't see an issue with this way of communicating the failure, I think it's the cleanest method, because we don't need to fabricate a bogus arg value in userspace as a placeholder.

While I like not transmitting a bogus argument to userspace within e->args, and I like not mixing up the arg read outcomes, I am worried that this approach will require changes in bpf programs which will result in their instruction count exceeded 4k (on kernels that don't support __LARGE_BPF_PROG).

While experimenting with what I currently have, I found that in many code paths, I didn't have any instructions to spare for 4.19. I couldn't even check for an int32 being 0 without crossing the 4k boundary. In the current approach, I was able to avoid adding any new instructions in the mostly full programs by completely isolating the change from 4.19 with __LARGE_BPF_PROG. But with this proposal, I think I'll need to make 4.19 aware of the uint32 value that proceeds each arg, so I worry that I won't be able to make it fit. I'll keep a close eye on this and let you know if I hit this 4k instruction count problem and cannot get around it.

I think these other errors mean passing NULL pointers to read_arg() (which are problems that pre-dated resolve functionality). Is that what you're thinking here or something else?

I was thinking about it more like general approach to for read_arg plus for resolve status, but your point about 4.19 seems valid.. I guess we could start with just resolve error if the rest wont fit in.. perhaps there's a way to split read_arg into 2 tailcalls .. on the other hand it's just 4.19 issue, and I wonder how long this will be around

or we make it just for __LARGE_BPF_PROG which might mean bit uglier code for user space, but I guess we have to try first to see ;-)

I'm not sure if there would be a problem if the user expected a receive an KprobeArgument_UintArg, for example, in a specific position within the args list, but instead found type KprobeArgument_ArgFailed?

the KprobeArgument_ArgFailed makes sense to me, I think it's ok to find unexpected KprobeArgument_UintArg, because we were not able to read expected type anyway.. @kkourt any idea about that?

Is the idea to have an KprobeArgument_ArgFailed type for arguments that we failed to read (e.g., in the case of a resolve that hit a NULL value)? If so, makes sense to me.

Yeap, that's the idea. Thanks for the feedback.

Use the configuration to determine if we are examining a valid argument. Commit 8dea69f ("tetragon: Read only defined arguments ") made it so we only read arguments defined in the policy. So, we need to adapt the rate limiting logic to account for this. Otherwise, we might put garbage in the key. Signed-off-by: Andy Strohman <[email protected]>

We have been ignoring errors returned from bpf_probe_read() when dereferencing during resolve. bpf_probe_read() returns a negative value when it detects that a seg fault would occur. This is problematic both in terms of exposing bogus argument values in the event and also when preforming filtering on behalf of the matchArgs selector. This change notes when we were unable to dereference so that arg filtering does not happen against a bogus value. For each arg, we set the depth at which the dereference failed in a status header that precedes the argument value. If the status value is 0, then no error occurred and the argument's value follows. Any non-zero status indicates failure reading the arg value. When userspace ecounters this non-zero status, it will not attempt to read the arg's value. This allows us to avoid transmitting bogus argument values from the bpf program to userspace. Signed-off-by: Andy Strohman <[email protected]>

This arg value is found in the event instead of a configured arg, when the configured arg could not be read. It includes a message that describes what error occurred while attempting to read the arg value. Signed-off-by: Andy Strohman <[email protected]>

The previous commit altered api/v1/tetragon/tetragon.proto. The is commit reflects the outcome of running "make protogen" which is required in order to generated code/documentation based on that that change. Signed-off-by: Andy Strohman <[email protected]>

Signed-off-by: Andy Strohman <[email protected]>

This program will be used to test how resolve handles NULL pointers. Signed-off-by: Andy Strohman <[email protected]>

Test that the resolve error depth is expected when NULL pointers are encountered. Test that matchArgs will not match when resolve fails. Signed-off-by: Andy Strohman <[email protected]>

With this change, GetIndex() has the same semantic meaning regardless of hook type. It returns the index of the Arg within the spec. This fixes a returnCopy issue where we were overwriting the wrong arg when we merge the arg value after the retprobe event. Signed-off-by: Andy Strohman <[email protected]>

This test confirms that we don't confuse the meaning of index. Arguments have a position (index) within the args section of a tracing policy spec. The defined arg's position within the spec is not related to its position (index) within the function signature or tracepoint that is being hooked. We had a bug where we confused the meaning of index. Retprobes need to overwrite a bogus argument value from the event that corresponded to function entry. We locate the argument which needs to be overwritten by referencing its position with the spec. The bug was overwriting based on arg's position within the function signature. This returnCopy test is constructed such that the configured argument's index within the spec does not correspond to the arguments position within the function signature. Signed-off-by: Andy Strohman <[email protected]>

To a string that shows which pointer could not be dereferenced. Signed-off-by: Andy Strohman <[email protected]>

tdaudi reviewed Nov 11, 2025

View reviewed changes

tdaudi mentioned this pull request Nov 11, 2025

ladder : add tdaudi as a cilium member cilium/community#340

Merged

5 tasks

andrewstrohman force-pushed the pr/andrewstrohman/resolve-null branch from c9bf1d8 to 478412b Compare November 12, 2025 21:12

andrewstrohman mentioned this pull request Nov 13, 2025

Tetragon does not raise an event when the resolved value is null #3728

Open

andrewstrohman force-pushed the pr/andrewstrohman/resolve-null branch 17 times, most recently from 765d55d to f56102b Compare November 18, 2025 23:32

andrewstrohman mentioned this pull request Nov 19, 2025

returnCopy does not update the appropriate arg value within the event #4359

Open

andrewstrohman force-pushed the pr/andrewstrohman/resolve-null branch 5 times, most recently from e900a2c to 5bcb17c Compare November 21, 2025 00:27

andrewstrohman marked this pull request as ready for review November 21, 2025 01:32

andrewstrohman requested a review from a team as a code owner November 21, 2025 01:32

andrewstrohman requested review from ExceptionalHandler and tdaudi November 21, 2025 01:32

olsajiri requested review from olsajiri and tdaudi and removed request for tdaudi November 21, 2025 16:56

andrewstrohman marked this pull request as draft November 21, 2025 21:08

andrewstrohman marked this pull request as ready for review November 22, 2025 02:10

olsajiri reviewed Nov 23, 2025

View reviewed changes

andrewstrohman mentioned this pull request Nov 24, 2025

Add andrewstrohman as tetragon reviewer cilium/community#345

Merged

andrewstrohman force-pushed the pr/andrewstrohman/resolve-null branch 4 times, most recently from 957b438 to 8affc9d Compare December 6, 2025 07:33

andrewstrohman added 11 commits December 6, 2025 07:47

api: make protogen

7d6d9db

The previous commit altered api/v1/tetragon/tetragon.proto. The is commit reflects the outcome of running "make protogen" which is required in order to generated code/documentation based on that that change. Signed-off-by: Andy Strohman <[email protected]>

tracepoint: handle arg status header

eb90415

Signed-off-by: Andy Strohman <[email protected]>

tetragon: Plumb resolve error from bpf to event

222f1af

Signed-off-by: Andy Strohman <[email protected]>

tester-progs: Add program that passes NULL

86c7354

This program will be used to test how resolve handles NULL pointers. Signed-off-by: Andy Strohman <[email protected]>

tetragon: test uprobe resolve failure

8a86968

Test that the resolve error depth is expected when NULL pointers are encountered. Test that matchArgs will not match when resolve fails. Signed-off-by: Andy Strohman <[email protected]>

tetragon: convert depth of resolve failure

8affc9d

To a string that shows which pointer could not be dereferenced. Signed-off-by: Andy Strohman <[email protected]>

handle resolve of NULL pointers #4327

Are you sure you want to change the base?

handle resolve of NULL pointers #4327

Uh oh!

Conversation

andrewstrohman commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdaudi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdaudi Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

andrewstrohman Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

tdaudi Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

andrewstrohman Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netlify bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for tetragon ready!

Uh oh!

andrewstrohman commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkourt commented Nov 13, 2025

Uh oh!

andrewstrohman commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

olsajiri commented Nov 21, 2025

Uh oh!

andrewstrohman commented Nov 22, 2025

Uh oh!

tdaudi commented Nov 22, 2025

Uh oh!

olsajiri Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

andrewstrohman Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olsajiri Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

kkourt Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

andrewstrohman Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andrewstrohman commented Nov 11, 2025 •

edited

Loading

tdaudi left a comment •

edited

Loading

andrewstrohman Nov 14, 2025 •

edited

Loading

netlify bot commented Nov 12, 2025 •

edited

Loading

andrewstrohman commented Nov 13, 2025 •

edited

Loading

andrewstrohman commented Nov 21, 2025 •

edited

Loading

andrewstrohman Nov 25, 2025 •

edited

Loading